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for  education.  In  this  thesis,  we  consider  the  problem  of  designing  CPSGrader,  an  automatic 
grader  for  laboratory-based  courses  in  the  area  of  cyber-physical  systems.  The  work  is  mo¬ 
tivated  by  a  UC  Berkeley  course  in  which  students  program  a  robot  for  specified  navigation 
tasks.  Given  a  candidate  student  solution  (control  program  for  the  robot),  CPSGrader  first 
checks  whether  the  robot  performs  the  task  correctly  under  a  representative  set  of  envi¬ 
ronment  conditions.  If  it  does  not,  CPSGrader  automatically  generates  feedback  hinting 
at  possible  errors  in  the  program.  CPSGrader  is  based  on  a  novel  notion  of  constrained 
parameterized  tests  based  on  signal  temporal  logic  (STL)  that  capture  symptoms  point¬ 
ing  to  success  or  causes  of  failure  in  traces  obtained  from  a  realistic  simulator.  We  define 
and  solve  the  problem  of  synthesizing  constraints  on  a  parameterized  test  such  that  it  is 
consistent  with  a  set  of  reference  solutions  with  and  without  the  desired  symptom.  We 
also  develop  a  clustering-based  active  learning  technique  that  selects  from  a  large  database 
of  unlabeled  solutions,  a  small  number  of  reference  solutions  for  the  expert  to  label.  The 
goal  is  to  achieve  better  accuracy  of  fault  identification  with  fewer  reference  solutions  as 
compared  to  random  selection.  We  demonstrate  the  effectiveness  of  CPSGrader  using  two 
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data  sets:  one  obtained  from  an  on-campus  laboratory-based  course  at  UC  Berkeley,  and 
the  other  from  a  massive  open  online  course  (MOOC)  offering.  In  addition,  CPSGrader 
was  successfully  deployed  in  the  laboratory  section  of  this  MOOC  on  the  edX  platform. 
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Chapter  1 


Introduction 

1.1  Formal  Methods  in  Education 

Massive  open  online  courses  (MOOCs)  [1]  and  related  technological  advances  promise  to 
bring  world-class  education  to  anyone  with  Internet  access.  Additionally,  MOOCs  present 
a  range  of  problems  to  which  the  field  of  formal  methods  has  much  to  contribute.  These 
include  automatic  grading ,  automated  exercise  generation ,  and  virtual  laboratory  environ¬ 
ments.  In  automatic  grading,  a  computer  program  verifies  that  a  candidate  solution  pro¬ 
vided  by  a  student  is  “correct”,  i.e.,  that  it  meets  certain  instructor-specified  criteria  (the 
specification).  In  addition,  and  particularly  when  the  solution  is  incorrect,  the  automatic 
grader  (henceforth,  auto-grader )  should  provide  feedback  to  the  student  as  to  where  he/she 
went  wrong.  Automatic  exercise  generation  is  the  process  of  synthesizing  problems  (with 
associated  solutions)  that  test  students’  understanding  of  course  material,  often  starting 
from  instructor-provided  sample  problems.  Finally,  for  courses  involving  laboratory  assign¬ 
ments,  a  virtual  laboratory  (henceforth,  lab)  seeks  to  provide  the  remote  student  with  an 
experience  similar  to  that  provided  in  a  real,  on-campus  lab. 

Lab-based  courses  that  are  not  software-only  pose  a  particular  technical  challenge.  An 
example  of  such  a  course  is  Introduction  to  Embedded  Systems  at  UC  Berkeley  [2].  In  this 
course,  students  not  only  learn  theoretical  content  on  modeling,  design,  and  analysis  [3], 
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but  also  perform  lab  assignments  on  programming  an  embedded  platform  interfaced  to  a 
mobile  robot  [4],  What  would  an  online  lab  assignment  in  embedded  systems  look  like? 
In  an  ideal  world,  we  would  provide  an  infrastructure  where  students  can  log  in  remotely 
to  a  computer  which  has  been  preconfigured  with  all  development  tools  and  laboratory 
exercises;  in  fact,  pilot  projects  exploring  this  approach  have  already  been  undertaken 
(e.g.,  see  [5]).  However,  in  the  MOOC  setting,  the  large  numbers  of  students  makes  such 
a  remotely-accessible  physical  lab  expensive  and  impractical.  A  virtual  lab  environment, 
driven  by  simulation  of  real-world  environments,  appears  to  be  the  only  solution  at  present. 
For  example,  the  MIT  circuits  course  (MITx  6.002x)  uses  rudimentary  circuit  simulation 
software  [6]. 

In  this  thesis,  we  formalize  the  auto-grading  problem  for  a  virtual  lab  environment  in  the 
field  of  embedded  and  cyber-physical  systems  (CPS).  The  virtual  lab  under  consideration 
is  the  one  designed  for  EECS149.1x  [7],  a  MOOC  on  Cyber-Physical  Systems  offered  on  the 
edX  platform,  based  on  the  afore-mentioned  on-campus  course.  Next,  we  give  the  details 
of  this  virtual  laboratory  under  consideration. 

1.2  Target  Laboratory  Course 

The  embedded  systems  laboratory  course  offered  at  University  of  California,  Berkeley 
employs  a  custom  mobile  robotic  platform  called  the  Cal  Climber  [8] ,  [9] .  The  Cal  Climber 
is  based  on  the  commercially-available  iRobot  Create  (derived  from  the  iRobot-Roomba 
autonomous  vacuum  cleaner)  (Fig.  1.1a),  and  the  National  Instruments  myRIO  embedded 
controller.  This  off-the-shelf  platform  is  capable  of  driving,  sensing  bumps  and  cliffs,  exe¬ 
cuting  simple  scripts,  and  communicating  with  an  external  controller.  This  configuration 
demonstrates  the  composition  of  cyber-physical  systems,  where  a  robotics  platform  is  mod¬ 
eled  as  a  sub-system  and  treated  as  a  collection  of  sensors  and  actuators  potentially  located 
beyond  a  network  boundary.  The  problem  statement  centers  on  model-based  design  and  is 
given  as  follows  (paraphrased  from  [4]): 
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Design  a  StateChart  to  drive  the  Cal  Climber.  On  level  ground,  your  robot  should 
drive  straight.  When  an  obstacle  is  encountered,  such  as  a  cliff  or  an  object,  your 
robot  should  navigate  around  the  object  and  continue  in  its  original  orientation. 

On  an  incline,  your  robot  should  navigate  uphill,  while  still  avoiding  obstacles. 

Use  the  accelerometer  to  detect  an  incline  and  as  input  to  a  control  algorithm 
that  maintains  uphill  orientation. 

Source  files  distributed  with  the  Cal  Climber  laboratory  are  structured  such  that  stu¬ 
dents  only  need  to  implement  a  function  that  receives  as  arguments  the  most  recent  values 
of  the  accelerometer  and  robot  sensors  and  returns  desired  wheel  speeds.  This  function  is 
called  repeatedly  at  short  regular  intervals  of  time  (60  ms  in  our  case)  with  most  recent 
sensor  and  accelerometer  data.  Students  implement  this  function  for  controlling  the  Cal 
Climber.  In  the  on-campus  course,  students  first  prototype  their  controller  to  work  within 
a  simulated  environment  (without  any  auto-grading)  based  on  the  LabVIEW  Robotics 
Environment  Simulator  by  National  Instruments.  The  simulator  is  based  on  the  Open 
Dynamics  Engine  [10]  rigid  body  dynamics  software  that  can  simulate  robots  in  a  virtual 
environment  (Fig.  1.1b).  In  EECS149.1x,  the  afore-mentioned  online  version  of  the  course, 
the  same  simulator,  extended  with  the  auto-grader  described  in  the  present  paper,  has  been 
used(Fig.  1.1c). 

We  refer  to  the  functions  implemented  by  students  as  solutions  or  controllers.  A  solution 
is  evaluated  in  a  collection  of  environments  against  a  collection  of  goal  and  fault  properties, 
forming  test  benches  (a  notion  formalized  in  the  following  sections).  For  this  purpose,  the 
simulator  allows  to  define  customized  environments  (with  walls,  objects,  obstacles,  ramps, 
etc)  described  in  XML  files  and  we  further  extended  its  API  to  facilitate  the  exportation  of 
simulation  traces  to  external  property  monitoring  tools. 

1.3  Problem  Motivation 

As  mentioned  before,  in  this  thesis  we  tackle  the  problem  of  auto-grading  a  CPS  lab. 
Auto- grading  is  a  verification  and  debugging  problem  where  the  objective  is  to  be  able  to 
check  whether  a  student’s  solution  meets  the  desired  goals  and  also  provide  feedback  point¬ 
ing  to  possible  causes  of  failure.  The  main  point  we  make  here  is  that  the  dynamical  model 
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Figure  1.1:  (a)  Cal  Climber  laboratory  platform,  (b)  Cal  Climber  in  the  LabVIEW  Robotics 
Environment  Simulator,  (b)  Simulator  with  auto-grading  functionality  used  in  EECS  149. lx 
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for  the  virtual  lab  is  so  complex  that  simulation  is  currently  the  only  verification  method 
that  can  be  practically  employed.  Thus,  the  auto-grader  is  based  on  simulation-based  veri¬ 
fication.  The  high-level  approach,  previously  hinted  at  in  a  position  paper  [11],  is  as  follows. 
Correctness  properties  are  formalized  in  signal  temporal  logic  (STL)  [12].  Simulation  test 
benches  are  created  by  a  combination  of  manual  environment  setup  and  simulation-based 
falsification  implemented  in  tools  such  as  Breach  [13].  For  each  lab  assignment,  there  is 
an  end-to-end  correctness  property ,  hereafter  referred  to  as  the  goal  property.  If  the  goal 
is  satisfied,  the  student  solution  (hereafter  referred  to  as  a  controller )  is  deemed  correct. 
Otherwise,  it  is  incorrect,  and  more  analysis  must  be  performed  to  identify  the  mistake 
(fault)  and  provide  feedback.  This  latter  analysis  is  based  on  monitoring  simulation  traces 
of  the  student  controller  on  a  library  of  known  faults,  also  formalized  in  STL.  If  any  of 
these  “fault  properties”  hold  for  a  student  controller,  they  are  provided  to  the  student  as 
feedback. 

This  approach,  though  straightforward  on  the  surface,  requires  further  technical  ad¬ 
vances  to  be  effective.  The  first  problem  is  that  the  STL  properties  that  encode  both 
goal  and  fault  properties  reference  parameters  that  can  vary  over  the  set  of  environments 
and  student  controllers;  in  fact,  such  variation  must  be  allowed.  For  example,  in  a  real 
lab,  students  may  program  robots  to  move  at  different  velocities  while  performing  obstacle 
avoidance.  If  the  goal  of  the  lab  is  only  to  correctly  avoid  an  obstacle,  the  speed  at  which 
it  does  so  is  irrelevant.  However,  given  the  variations  in  the  controllers  students  design, 
setting  a  reasonable  range  for  parameters  such  as  time  or  velocity  in  STL  properties  can  be 
tricky.  Similarly,  environments  can  also  be  parametric  (for  example,  the  location  of  obsta¬ 
cles)  and  tests  should  be  synthesized  in  a  manner  that  accounts  for  these  variations.  Thus, 
an  effective  approach  to  auto-grading  CPS  labs  requires  one  to  solve  a  certain  parameter 
synthesis  problem. 

We  formalize  this  parameter  synthesis  problem  and  give  an  algorithm  to  solve  it.  First, 
we  define  the  notion  of  a  parametrized  test  which  is  a  combination  of  a  parametrized  envi¬ 
ronment  and  a  parametrized  STL  (PSTL)  property.  A  parametrized  test  is  thus  a  collection 
of  tests.  However,  as  discussed  above,  one  needs  to  impose  a  constraint  on  this  collection 
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to  capture  “legal”  variations  in  student  solutions.  Such  a  constraint,  termed  a  sub- domain, 
defines  the  allowed  set  of  parameter  valuations.  However,  manually  computing  this  sub- 
domain  is  tedious  and  error-prone.  We  therefore  give  an  algorithmic  approach  to  synthesize 
the  sub-domain  from  reference  controllers  that  should/should  not  pass  the  test  bench.  In 
practice,  it  is  easier  for  instructors  to  provide  such  reference  controllers  than  it  is  to  man¬ 
ually  compute  sub-domains.  In  machine  learning  terminology,  this  can  be  thought  of  as 
the  training  phase.  The  resulting  constrained  parameterized  test  bench  then  becomes  the 
“specification”  that  determines  whether  a  student  solution  is  correct,  and,  if  not,  which 
fault  is  present.  In  machine  learning  terminology,  this  would  be  the  classification  phase. 
Further,  we  identify  a  property,  monotonicity ,  under  which  we  can  efficiently  compute  the 
sub-domain,  and  which  holds  for  the  lab  of  interest. 

Another  issue  with  this  approach  is  the  availability  of  “positive”  and  “negative”  ref¬ 
erence  controllers.  An  instructor  has  to  manually  look  at  the  simulation  video  to  decide 
whether  a  particular  controller  is  good  or  bad  and  then  it  can  be  used  for  training  the  test 
bench.  In  essence,  labeling  of  controllers  is  an  expensive  manual  process.  We  formulate 
the  problem  of  obtaining  labeled  data  as  an  active  learning  problem.  We  give  a  clustering- 
based  active  learning  methodology  that  takes  as  input  a  large  set  of  unlabeled  controllers 
collected  over  various  stages  of  development  and  chooses  the  controllers  that  an  instructor 
should  label  to  get  high  accuracy  of  classification  with  fewer  number  of  training  examples 
as  compared  to  random  selection.  Since  the  simulation  traces  are  timed  sequences  of  mul¬ 
tidimensional  variables  that  capture  environment  and  state  data,  we  choose  dynamic  time 
warping  distance  [14]  as  a  measure  of  dissimilarity  between  controller  behavior  and  use  that 
for  clustering. 

We  believe  clustering  is  useful  because  amongst  many  student  solutions,  the  total  num¬ 
ber  solution  strategies  are  still  few.  Furthermore,  any  two  solutions  that  follow  the  same 
strategy  will  likely  have  the  same  set  of  faults  present  or  absent.  Hence,  if  some  clustering 
technique  can  identify  each  strategy  as  a  separate  cluster,  then  choosing  one  example  from 
each  cluster  should  account  for  a  training  set  with  good  coverage  and  the  synthesized  test 
bench  will  have  high  accuracy. 
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Any  auto-grader  must  have  at  least  two  desirable  properties:  accuracy  and  efficiency. 
The  former  means  that  the  auto-grader  must  correctly  classify  right  and  wrong  student 
solutions,  and  for  wrong  solutions,  correctly  explain  the  mistake  (fault).  The  latter  means 
that  it  must  run  efficiently  in  practice.  For  efficiency,  we  show  how  monotonicity  can  be 
exploited  again  to  avoid  the  need  to  run  the  entire  constrained  parametric  test  bench. 
Instead,  we  define  the  notion  of  an  adequate  test  sample  and  show  that  it  is  much  smaller  in 
practice  than  the  entire  constrained  test  bench.  We  also  provide  an  experimental  evaluation 
on  the  on-campus  lab  demonstrating  that  our  approach  is  both  accurate  and  efficient  in 
practice.  We  also  test  our  active  learning  approach  and  show  that  selection  of  training 
examples  based  on  clustering  leads  to  higher  accuracy  of  classification  as  compared  to 
random  selection  of  the  same  number  of  training  examples,  and  therefore  it  can  lead  to 
reduced  overhead  for  instructors  in  providing  labeled  data. 

1.4  Contributions 

To  summarize,  the  main  novel  contributions  of  this  work  are: 

•  A  formalization  of  the  auto-grading  problem  for  simulation-based  virtual  laboratories 
in  cyber-physical  systems, 

•  A  formalization  of  the  problem  of  synthesizing  a  constrained  parametric  test  bench 
for  the  auto-grader  along  with  an  efficient  solution  approach  based  on  monotonicity, 

•  A  novel  clustering-based  active  learning  approach  to  aid  in  generation  of  labeled  train¬ 
ing  data  for  the  synthesis  algorithm,  and 

•  An  empirical  evaluation  demonstrating  the  accuracy  and  efficiency  of  CPSGrader, 
the  auto-grader  for  the  on-campus  embedded  systems  lab,  and  also  the  effectiveness 
of  the  clustering-based  active  learning,  on  a  database  of  student  solutions  from:  (1) 
on-campus  offering  of  the  course  EECS  149,  and  (2)  the  online  edition  of  the  same 
course  on  edX  (EECS  149. lx.) 
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Note  that  a  large  part  of  this  thesis  is  based  on  the  EMSOFT  2014  paper  [15],  joint  work 
with  Alexandre  Donze,  Jeff  C.  Jensen,  and  Sanjit  A.  Seshia. 

1.5  Related  Work 

There  is  a  growing  number  of  efforts  to  incorporate  formal  methods  into  technologies  for 
education.  Singh  et  al.  [16]  present  an  approach  to  automatically  generate  problems  in  high- 
school  algebra.  Sadigh  et  al.  [17]  show  how  the  problem  of  generating  variants  of  exercises  in 
an  Embedded  Systems  textbook  [3]  can  be  mapped  to  standard  problems  in  formal  methods 
and  apply  some  of  these  methods  to  classes  of  exercises.  Singh  et  al.  [18]  present  an  auto- 
grader  for  a  Python  programming  course,  where,  similar  to  the  present  paper,  feedback  is 
generated  based  on  a  library  of  common  mistakes,  but,  differently,  the  technical  approach 
uses  an  encoding  to  SAT-based  program  synthesis.  Alur  et  al.  [19]  consider  auto-grading 
DFA  construction  problems,  providing  a  novel  blend  of  three  techniques  for  assigning  partial 
grades  for  incorrect  answers.  This  thesis  proposes  different  formalisms  and  algorithms,  and 
represents  the  first  auto-grader  for  lab  assignments  in  the  area  of  embedded,  cyber-physical 
systems. 

Related  work  on  parameter  synthesis  for  temporal  logic  and  use  of  clustering  for  active 
learning  is  covered  in  later  chapters. 

The  outline  of  the  thesis  is  as  follows.  We  introduce  basic  terminology  and  background 
results  in  Ch.  2.  In  Ch.  3,  we  describe  the  main  theoretical  contributions,  including  our 
formalization  of  the  problem  of  synthesis  of  test  benches  and  solution  approach.  In  Ch.  4,  we 
describe  the  clustering-based  active  learning  approach  that  serves  as  an  aid  to  the  synthesis 
algorithm.  Experimental  results  are  given  in  Sec.  5.  We  conclude  with  future  directions 


Ch.  6. 


Chapter  2 


Background 

2.1  Signals,  Controllers,  and  Environments 

Definition  1  (Signal)  A  (uni- dimensional)  signal  is  a  function  mapping  the  time  domain 
T  =  M>o  to  the  reals  M. 

Boolean  signals ,  used  to  represent  discrete  dynamics,  are  signals  whose  values  are  re¬ 
stricted  to  false  (denoted  _L)  and  true  (denoted  T).  Vectors  in  Mn  with  n  >  1  are  denoted 
in  bold  fonts  and  their  components  are  indexed  from  1  to  n,  for  example,  p  =  (p\ .  ■  ■  ■  ,pn). 
Likewise,  a  multi- dimensional  signal  x  is  a  function  from  T  to  Mn  such  that  Vf  G  T, 
x(f)  =  {x\ (f),  •  •  •  ,xn(t)).  We  will  use  the  term  “signal”  to  also  refer  to  multi-dimensional 
signals. 

Definition  2  (Controller)  A  controller  C  is  a  (deterministic)  dynamical  system  that  takes 
as  input  a  signal  y  (f)  and  computes  an  output  signal  u(t).  It  is  common  to  drop  time,  and 
say  u  =  C(y). 

Note  that  we  make  no  assumption  about  how  a  controller  computes  its  output.  A 
controller  can  have  discrete  or  continuous  dynamics  or  it  can  be  a  hybrid  system.  As 
an  example,  a  program  running  on  the  Cal  Climber  is  a  controller  that  takes  bumps  and 
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cliff  sensors  signals,  and  accelerometer  data  as  input  y (t)  =  (bump(t),  cliff(i),  accel(t)),  and 
responds  with  the  desired  left  and  right  wheel  speeds  as  output  u(t)  =  (Iws (t),  rws(t)). 

Definition  3  (Environment)  An  environment  E  for  a  controller  C  is  a  dynamical  system 
generating  all  inputs  to  C . 

As  before,  we  make  no  assumptions  about  the  form  of  the  environment.  All  we  assume 
is  the  existence  of  a  simulator  that  can  take  representations  of  E  and  C ,  compose  them, 
and  produce  execution  traces  of  the  composite  system.  In  other  words,  the  simulator  is  an 
oracle  that  gives  semantics  to  the  composite  system  £,||C'. 

We  only  consider  deterministic  environments,  i.e.,  the  composition  of  a  controller  and 
an  environment  has  deterministic  behavior.  For  example,  an  arena  composed  of  obstacles 
and  hills  on  level  ground  is  an  environment  for  the  Cal  Climber  controller.  Formally,  a 
trace  sim(C',  E)  is  a  multi-dimensional  signal  (x(t),  y(i),  u(i))  consisting  of  the  inputs  y 
and  outputs  u  of  the  controller  and  optionally  other  signals  x  regarding  the  state  of  the 
environment.  For  example,  the  position  and  orientation  (in  the  plane  of  the  ground)  of 
the  robot  in  the  arena  x(t)  =  (pos(i),  angle(i))  are  a  part  of  the  observable  environment 
state.  By  varying  the  environment,  or  the  property  being  verified  on  the  composition  (see 
Sec.  2.2),  the  instructor  can  test  different  features  of  the  controller. 

2.2  Signal  Temporal  Logic 

Since  propositional  (linear)  temporal  logic  was  introduced  by  Amir  Pnueli  [20] ,  variants 
have  also  been  proposed.  Temporal  logics  to  reason  about  real-time  signals,  such  as  Timed 
Propositional  Temporal  Logic  [21],  and  Metric  Temporal  Logic  (MTL)  [22]  were  introduced 
later  to  deal  with  dense-time  signals.  More  recently,  Signal  Temporal  Logic  [12]  was  pro¬ 
posed  in  the  context  of  analog  and  mixed-signal  circuits  to  deal  with  dense-time  signals 
taking  values  over  both  discrete  and  continuous  domains.  We  use  STL  as  the  specification 
language  for  the  Embedded  Systems  lab  assignment.  Goals  that  the  robotic  controller  must 
achieve  are  expressed  as  STL  properties. 
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The  primitive  constraints,  or  predicates,  in  STL  take  the  form  p  =  /(x)  ~  ir ,  where 
/  is  a  scalar-valued  function  over  the  signal  x,  ~£  {<,<,>,>,=,  7^},  and  7r  is  a  real 
number.  Temporal  formulas  are  formed  using  temporal  operators,  “always”  (denoted  as 
□),  “eventually”  (denoted  as  0)  and  “until”  (denoted  as  U).  Each  temporal  operator  is 
indexed  by  intervals  of  the  form  (a,  b ),  (a,  b],  [a,  b),  [a,  b],  (a,  00)  or  [a,  00)  where  each  of  a,  b 
is  a  non-negative  real-valued  constant.  If  /  is  an  interval,  then  an  STL  formula  is  written 
using  the  grammar: 

ip  :=  T  |  p  |  -up  |  ip  1  A  tf2  |  pi  U/  <p2 

The  always  and  eventually  operators  are  defined  as  special  cases  of  the  until  operator  in 
the  standard  way:  Ojp  =  ->() i~np ,  () up  =  TU;  p.  When  the  interval  I  is  omitted,  we 
use  the  default  interval  of  [0,+oo).  The  semantics  of  STL  formulas  are  defined  informally 
as  follows.  The  signal  x  satisfies  /(x)  >  10  at  time  t.  (where  t  >  0)  if  /(x(f))  >  10.  It 
satisfies  p  =  □[0,2)  (x  >  —  1)  if  for  all  time  0  <  t  <  2,  x(t)  >  —1.  The  signal  x\  satisfies 
p  =  On, 2)  xi  >  0-4  iff  there  exists  time  t  such  that  1  <  t  <  2  and  x\ (t)  >  0.4.  The 
two-dimensional  signal  x  =  (x\,X2)  satisfies  the  formula  p  =  (.xi  >  10)  U [2.3,4.51  (x2  <  1) 
iff  there  is  some  time  u  where  2.3  <  u  <  4.5  and  X2 (u)  <  1,  and  for  all  time  v  in  [2.3,  u), 
x\(u)  is  greater  than  10.  The  formal  semantics  of  STL  can  be  found  in  [12]  and  is  given  in 
Appendix  A. 

Parametric  Signal  Temporal  Logic  (PSTL)  is  an  extension  of  STL  introduced  in  [23]  to 
define  template  formulas  containing  unknown  parameters.  Syntactically  speaking,  a  PSTL 
formula  is  an  STL  formula  where  numeric  constants,  either  in  the  constraints  given  by  the 
predicates  p  or  in  the  time  intervals  of  the  temporal  operators,  can  be  replaced  by  symbolic 
parameters. 

An  STL  formula  is  obtained  by  pairing  a  PSTL  formula  with  a  valuation  function  that 
assigns  a  value  to  each  symbolic  parameter.  For  example,  consider  the  PSTL  formula 
p(i r,  r)  =  □[o,7-p  >  7T)  with  symbolic  parameters  ir  (scale)  and  r  (time).  The  STL  formula 
□[0,10]®  >  1.2  is  an  instance  of  p  obtained  with  the  valuation  v  =  {r  10,  7r  i — >■  1.2}. 
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2.3  Defects  and  Faults 


A  controller  is  usually  designed  to  meet  certain  goals.  For  example,  the  Cal  Climber 
controller  should  be  able  to  navigate  around  obstacles  and  climb  hills.  To  talk  about  grading 
and  feedback  generation,  we  introduce  some  relevant  terminology  from  the  fault  testing  and 
diagnosis  literature. 

Definition  4  (Defect,  symptom  and  fault)  Given  a  controller  and  an  environment  with 
some  desired  goals, 

•  A  defect  is  a  bug  in  the  controller  implementation  that  leads  to  failure  in  meeting  goals; 

•  A  symptom  is  an  interesting  pattern  in  a  simulation  trace  of  the  controller- environment 
composition  that  can  be  characterized,  for  example,  using  STL,  and 

•  A  fault  is  a  symptom  that  is  present  in  a  trace  as  a  result  of  some  defect  in  the  controller. 

A  general  symptom,  such  as  the  inability  to  meet  an  end-to-end  correctness  goal  (for 
example,  obstacle  avoidance),  is  a  fault  that  could  be  the  result  of  multiple  defects  in  the 
controller.  On  the  other  hand,  certain  specific  faults  could  be  mapped  to  specific  kinds 
of  defects.  As  an  example,  consider  an  obstacle  avoidance  strategy  for  the  Cal  Climber 
controller,  implemented  in  a  language  like  C.  The  strategy  states  that  every  time  the  bump 
sensor  signal  indicates  a  bump,  the  robot  backs  up,  moves  some  distance  to  either  right 
or  left  and  then  re-orients  by  turning  in-place  until  the  heading  direction  is  same  as  the 
original  direction  angle0.  A  controller  will  check  the  guard  | a ngle(t)  -  angle0|  <  e  for  some 
small  e  >  0  to  determine  when  to  stop  turning  in  the  re-orientation  mode.  A  defect  can 
be  introduced  by  replacing  this  guard  by  the  exact  equality  check  angle(t)  ==  angle0.  This 
modification  usually  leads  to  failure  in  practice,  because  the  controller  implementation  polls 
its  sensors  at  certain  intervals,  and  therefore,  it  is  highly  unlikely  that  the  sensor  value  at 
some  polled  time  t,  angle(f),  will  be  exactly  angle0.  The  fault  resulting  from  this  defect  is 
that  in  the  re-orientation  mode,  the  robot  keeps  turning  in-place  while  making  full  circles 
multiple  times.  We  call  this  the  circle  fault  and  will  revisit  it  again  in  the  paper. 
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The  ability  to  classify  traces  that  present  a  fault  from  those  that  do  not  is  important 
for  auto-grading.  Using  this  classification,  we  can  not  only  separate  correct  solutions  from 
incorrect  ones  but  also  generate  diagnostic  feedback  for  failed  traces  by  monitoring  for 
relevant  faults  that  will  likely  correspond  to  known  defects. 

2.4  Dynamic  Time  Warping  Distance  (DTW) 

In  time  series  analysis,  dynamic  time  warping  (DTW)  [24]  is  an  algorithm  for  measuring 
similarity  between  two  temporal  sequences  which  may  vary  in  time  or  speed.  For  instance, 
similarities  in  movement  patterns  of  two  Cal  Climber  controllers  could  be  detected  using 
DTW,  even  if  one  was  moving  at  a  faster  wheel  speed  than  the  other,  or  if  the  matching 
sub-patterns  occur  at  different  absolute  times.  DTW  has  been  applied  to  temporal  se¬ 
quences  of  video,  audio,  and  graphics  data.  In  general,  DTW  is  a  method  that  calculates 
an  optimal  match  between  two  given  sequences  (e.g.  time  series)  with  certain  restrictions. 
The  sequences  are  “warped”  non-linearly  in  the  time  dimension  to  compute  the  optimal  se¬ 
quence  alignment  for  which  the  two  sequences  match  closely.  Hence,  the  distance  between 
two  sequences  is  agnostic  of  shifting  and  scaling,  making  DTW  suitable  for  our  purpose. 
DTW  can  be  extended  to  multi-dimensional  timed  sequences  [14]. 

2.5  Density-Based  Spatial  Clustering  (DBSCAN) 

Density-based  spatial  clustering  (DBSCAN)  [25]  clusters  the  samples  based  on  provided 
estimation  of  the  density  of  cluster  nodes.  It  can  take  as  input  pre-computed  pairwise 
distances  between  samples  and  does  not  need  the  feature  vectors  to  be  given  explicitly. 
The  number  of  clusters  does  not  need  to  be  specified  in  advance.  DBSCAN  starts  off  by 
finding  small  groups  of  points  that  are  very  close  to  each  other  and  marks  these  groups  as 
potential  clusters.  It  then  expands  each  cluster  by  including  other  close  neighbours.  It  can 
find  arbitrarily  shaped  clusters  and  is  robust  to  outliers.  These  features  make  it  a  good  fit 
for  our  application. 
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Chapter  3 


Synthesis  of  Test  Benches 


In  this  chapter,  we  formally  define  the  auto-grading  problem,  the  technical  challenge  in 
synthesizing  a  constrained  parametrized  set  of  tests,  and  our  approach  to  solve  this  problem. 

For  the  purpose  of  examples  in  this  chapter,  we  always  assume  the  controller  is  a  Cal 
Climber  program  and  the  environment  is  an  arena  with  one  robot,  multiple  obstacles  and 
fixed  inclines  (flat  rectangular  planks)  placed  on  level  ground.  Positions  in  the  arena  are 
given  using  x,  y,  and  z  coordinates  (in  meters).  Orientation  in  the  x  —  y  plane  is  given  by 
the  yaw  angle  varying  from  —180  to  180  degrees,  increasing  in  counter-clockwise  direction 
with  0  aligned  with  y-axis.  The  initial  position  and  orientation  of  the  robot  is  also  a  part 
of  the  environment. 

3.1  Constrained  Parametrized  Tests 

One  of  the  fundamental  notions  for  auto- grading  is  that  of  a  test. 

Definition  5  (Test)  A  pair  (E,  <p)  of  an  environment  E  and  an  STL  formula  p  is  called  a 
test.  A  test  passes  for  a  controller  C  if  and  only  if  sim(C.  E)  |=  ip. 
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Note  that  our  definition  of  a  test  is  different  from  the  more  common  definition  because 
in  addition  to  controller  inputs  (provided  in  form  of  an  environment),  it  also  contains  an 
assertion  specified  via  STL. 

For  the  end-to-end  correctness  property  (goal),  we  will  employ  the  convention  that  the 
STL  formula  <p  in  a  test  for  this  goal  is  the  negation  of  the  property  that  we  want  to  hold. 
In  other  words,  if  a  test  “passes,”  it  actually  means  that  the  correctness  property  did  not 
hold  for  that  test  case.  The  reason  for  this  convention  is  that  it  allows  us  to  treat  STL 
formulas  encoding  correctness  goals  and  fault  symptoms  in  a  symmetric  fashion,  something 
that  is  required  for  the  main  technical  results  of  this  paper.  Hereafter  we  will  treat  the  STL 
property  as  specifying  a  fault  unless  explicitly  stated  otherwise. 

Example  1  Consider  an  environment  Eq  with  a  square  obstacle  occupying  the  region 
[4.5,  5.5]  x  [5.0,  5.5].  The  initial  position  of  the  robot  is  (5.0, 4.9)  and  the  initial  orienta¬ 
tion  is  0.  Consider  the  STL  property  <p  =  □(pos.y  <  5.5)  which  states  that  the  robot  is 
never  able  to  reach  a  point  with  y  coordinate  more  than  5.5.  If  the  test  (Eq.  ip)  passes,  we 
can  assert  that  the  robot  did  not  meet  the  goal  of  being  able  to  avoid  the  obstacle. 

Consider  a  vector  of  symbolic  parameters  p  =  (pi,p2,  •  •  •  ,pn)-  A  valuation  function  v 
maps  each  symbolic  parameter  to  a  concrete  value  (for  example,  in  Mn)  and  v{pf)  denotes 
the  value  of  parameter  pi  in  v.  The  set  of  all  possible  valuations  of  p,  its  domain,  is  if. 

Definition  6  (Parametrized  Test)  A  parametrized  environment  is  an  environment  with 
unknown  parameters,  denoted  E( p).  A  parametrized  test  T(p)  =  (E(p),  <p(p))  is  a  pair  of 
a  parametrized  environment  E(p)  and  a  PSTL  formula  <p(p).  Given  any  valuation  v  €  it, 
T(u(p))  =  (E(u(p)),  <p(u(p)))  is  a  concrete  test. 

Example  2  Consider  the  same  environment  Eq  from  Example  1  except  that  the  initial 
orientation  of  the  robot  is  an  unknown  parameter  9inn  that  can  take  one  of  two  possible 
values  {—45,45}.  (See  Figure  3.1a.)  Consider  the  PSTL  property  (po(ir )  =  □(pos.y  > 
5.5  =>  7T i  <  pos.x  <  iru),  where  it  =  (tti,ttu),  with  unknown  parameters  7 q  and  iru  that  can 
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take  one  of  three  possible  values  {— oo,5.0, 00}  each.  The  property  states  that  if  the  robot  is 
able  to  get  around  the  obstacle  and  reach  a  point  pos.y  >  5.5,  then  pos.x  is  always  in  the 
interval  The  pair  To(dinit,Ti)  =  To(n))  is  an  example  of  a  parameterized 

test. 

Definition  7  (Satisfaction  Region)  The  satisfaction  region  Q,(C,  T(p))  of  a  controller  C 
on  a  parametrized  test  T(p)  is  the  set  of  all  valuations  v  of  p  such  that  r(u(p))  passes  for 
C,  i.e.,  fl(C,  T(p))  =  {v£  il|r(u(p))  passes  for  C}. 

Definition  8  (Test  Bench)  Given  a  parameterized  test  T(p)  and  a  set  of  valuations  p  C  it, 
the  pair  (r(p),  p)  is  called  a  constrained  parametrized  test,  simply  referred  to  as  test  bench. 
The  set  of  valuations  p  is  called  the  sub-domain  of  the  test  bench.  We  say  that  test  bench 
(r(p),  p)  succeeds  for  a  controller  C  iff  there  exists  a  v  G  p  such  that  T(v(p))  passes  for  C 
or  equivalently,  tt(C,  T(p))  n  p  is  non-empty. 

Since  a  test  bench  typically  includes  both  the  goal  properties  (determining  whether  a 
student  controller  is  correct  or  not)  and  the  fault  properties  (determining  the  mistakes  the 
student  made),  the  crux  of  the  auto-grading  problem  is  to  synthesize  a  test  bench  that  can 
accurately  classify  an  “unlabeled”  controller  as  correct /incorrect  and  with  the  fault (s),  if 
any.  Treating  goal  and  fault  properties  uniformly,  we  seek  to  synthesize  a  test  bench  to 
classify  whether  an  unlabeled  controller  exhibits  faulty  behaviors. 

To  auto-grade,  for  every  known  fault,  we  create  a  test  bench.  If  the  test  bench  succeeds 
for  an  unlabeled  controller,  we  can  conclusively  label  it  as  one  exhibiting  faulty  behavior. 
The  sub-domain  of  a  test  bench  essentially  identifies  the  set  of  tests  that  indicate  the 
presence  of  the  fault.  As  mentioned  earlier,  a  test  bench  can  also  be  used  in  a  similar  way 
to  check  if  a  given  controller  meets  goal  requirements  by  formulating  the  failure  to  meet  the 
goal  as  a  fault. 

Example  3  Consider  the  parameterized  test  To  from  Example  2.  Consider  the  sub-domain 
Po  =  {[dinit  >->•  45, 7r  1— >  (5.0,oo)],  [dinit  — 45, 7T  i— )•  (—oo,5.0)]}.  For  a  controller,  if  either 
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of  valuations  in  po  leads  to  a  test  that  passes,  it  provides  good  evidence  that  the  robot  is 
either  unable  to  avoid  the  obstacle  or  it  is  not  able  to  proceed  in  the  initial  direction.  (See 
Figure  3.1a.)  So  the  test  bench  (ro(0m*t,  ir),  po)  can  be  used  to  capture  this  failure  to  meet 
desired  goals. 

Example  4  Consider  an  environment  E\  with  a  fixed  incline  s.t.  the  uphill  direction 
is  along  the  orientation  0.  The  initial  location  of  the  robot  is  fixed  at  the  center  of 
the  bottom  boundary  of  the  incline.  The  initial  orientation  of  the  robot  is  a  parameter 
dinit  £  [—180,180].  We  wish  to  determine  whether  a  given  controller  (in  an  initial  ori¬ 
entation  pointing  towards  the  incline)  fails  to  climb  within  reasonable  time.  This  can  be 
expressed  via  the  STL  property  ipi(h,r)  =  □  [0iT](pos.z  <  h),  that  states  that  the  robot  is 
not  able  to  reach  the  height  h,  within  time  r.  The  parametrized  test  bench  Ti(6jnit,  h,  r)  = 
(Ei(9init),ipi(h1T)),  combined  with  the  sub-domain  p\  =  {[6init  vginit,h  i->-  Vh,r  i->- 
vT\s.t.  \vginit\  <  90  A  vT  >  60  A  Vh  <  0.4}  can  reliably  capture  the  failure  to  climb  to  a  height 
above  0.4  m  within  60  secs  for  some  initial  orientation  pointing  towards  the  hill. 

3.2  Synthesis  of  Test  Bench  Constraints 

Designing  a  test  bench  for  a  fault  involves  (i)  creating  a  parametrized  test  bench,  and 
(ii)  finding  a  sub-domain  of  the  parameters  such  that  it  reliably  captures  the  fault.  While 
creating  a  parametrized  test  bench  by  hand  is  easy,  in  our  experience  manually  coming  up 
with  the  sub-domain  is  tedious.  It  not  only  requires  the  instructor  to  be  a  relative  expert 
in  STL  and  run-time  verification,  but  also  requires  careful  observation  of  traces  where  the 
fault  is  known  to  be  present  and  not  present,  and  a  number  of  iterations  of  trial  and  error. 
On  the  other  hand,  instructors  can  easily  come  up  with  a  set  of  reference  controllers:  a  set 
C+  of  positive-labeled  controllers  that  are  all  known  to  exhibit  the  faulty  behavior,  and  a 
set  C~  of  negative-labeled  controllers  that  are  all  known  to  not  exhibit  the  faulty  behavior. 

We  define  below  the  problem  of  synthesizing  a  sub-domain  from  a  set  C+  of  positive- 
labeled  controllers  and  a  set  C~  of  negative-labeled  controllers. 
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Problem  1  Given  the  following:  (1)  a  parameterized  test  T(p)  with  a  domain  it  for  pa¬ 
rameters  p,  and  (2)  two  sets  C+  and  C~  of  controllers.  Synthesize  a  sub-domain  p  C  it 
s.t.  test  bench  (r(p),p)  does  not  succeed  for  any  C  £  C~  and  succeeds  for  all  C  £  C+ . 

We  can  see  that  any  sub-domain  that  does  not  intersect  with  fl(C,  T(p))  for  any  C  £  C~ 
and  has  a  non-empty  intersection  with  Q(C,  T(p))  for  every  C  £  C+  satisfies  the  require¬ 
ments  in  Problem  1.  From  amongst  all  these  possibilities,  we  choose  the  following  (also 
illustrated  in  Figure  3.1b) 


P=  U  ^>r(p))\  U  ^r(p))  (3-1) 

CeC+  CeC^ 

For  convenience,  we  use  fi(C+,r(p))  (and  fl(C_,r(p)))  to  refer  to  (J  Q(C,  T(p))  (and 

CeC+ 

1J  fi(C,  r(p))).  The  rationale  behind  this  choice  of  p  is  two-fold: 

CeC~ 

1.  To  increase  coverage  of  fault  detection  for  unlabeled  controllers,  we  wish  to  include  as 
much  of  fi(C+,r(p))  in  p  as  possible  because  every  parameter  valuation  in  that  set 
corresponds  to  a  test  that  passed  on  some  positively-labeled  controller,  i.e.  a  controller 
that  exhibits  the  faulty  behavior. 

2.  For  the  tests  corresponding  to  valuations  that  are  not  in  either  one  of  fl(C+,r(p))  or 
fi(C-,r(p)),  we  choose  a  lenient  grading  route  and  do  not  include  them  in  p.  This 
means  that  if  an  unlabeled  controller  does  not  pass  on  any  test  that  lies  in  fl(C+,r(p)), 
it  will  not  be  labeled  as  one  exhibiting  the  fault.  This  is  how  instructors  often  grade  labs 
in  practice,  i.e.,  if  tests  conclude  that  a  solution  may  or  may  not  be  faulty,  it  is  considered 
to  be  non-faulty,  pending  a  more  detailed  manual  review.  Here  we  are  also  assuming  that 
we  have  a  good  range  of  positive  and  negative  labeled  controllers  that  cover  a  wide  variety 
of  ways  in  which  the  fault  may  or  may  not  be  exhibited. 

To  generate  p  as  in  Eqn.  3.1,  we  compute  S2,  as  discussed  next. 
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C i ,  C2  positive  example  controllers 
C3,  C4  negative  example  controllers 

(b) 


Figure  3.1:  (a)  Environment  Eq  from  Examples  1,  2,  and  3  with  robot  R  and  obstacle  O. 
The  two  trajectories  shown  by  dotted  lines  meet  the  goals  for  the  cases  6  =  45  and  9  =  —45. 
(b)  The  hatched  region  is  the  sub-domain  p  obtained  from  satisfaction  regions  of  positive 
and  negative  controller  examples. 


3.3  Computing  the  Satisfaction  Region 


Given  a  controller  C  and  a  parametrized  test  T(p)  with  p  =  (pi,P2,  ■  •  •  ,Pk ),  we  wish  to 
compute  fl(C,  T(p)).  We  assume  that  all  parameters  are  numerical.  Every  parameter  that  is 
not  finite  valued  is  discretized  by  sampling  uniformly  at  some  granularity  within  reasonable 
lower  and  upper  bounds.  By  this  construction,  the  domain  11  is  now  a  finite  fc-dimensional 
array  and  can  be  written  as  a  Cartesian  product  of  finite  sets  Hi  x  il2  x  ■  ■  ■  x  it*.,  where  pi 
takes  values  in  the  set  it;.  We  assume  some  indexing  on  each  1 b  such  that  !l[ji,  j'2,  ■  ■  ■  ,jk] 
refers  to  the  element  of  it  formed  by  picking  the  jj-th  element  from  each  it;.  Moreover, 
we  assume  that  this  indexing  is  consistent  with  the  natural  order  defined  over  each  ilj 

(i.e. ,  a  lower  index  implies  a  smaller  value).  Let  N  =  max(|ilj|).  The  size  of  it  is  0(Nk). 

i 

Given  this  representation  of  it,  P(C,  T(p))  can  be  represented  by  a  fc-dimensional  bit-array, 
such  that,  P(C,  r(p))[ji,j2,---  ,jk\  =  1  iff  the  test  r(il[ji,  j2,  ■  ■  ■  ,  jk](p))  passes  on  the 
test  T (it[ji ,  J2,  •  •  •  ,jk](p))  passes  on  C.  The  most  naive  way  to  compute  P(C,  T(p))  is 
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to  perform  the  test  T(u(p))  for  every  valuation  v (p)  £  It.  We  describe  a  more  efficient 
approach  to  do  this  in  cases  where  the  test  bench  is  monotonic  in  one  or  more  parameters. 

Definition  9  (Monotonicity)  Given  an  order  =4  on  a  parameter  pi  in  the  parameter  vector 
P  =  (PI1P2,  •  •  •  i  Pk),  cl  parameterized  test  T(p)  is  monotonic  in  pi  if  for  every  controller  C 

\/v,v'  v(pi)  =4  v'(pi),\/j  /  i  ■  v'ipj)  =  v{pj ) 

r(u(p))  passes  for  C  =>  r(i/ (p))  passes  for  C  (3.2) 

Example  5  Consider  the  parameterized  test  Ti(0init,  h,r)  from  Example  f.  Consider  the 
order  <  over  h  and  two  values  Vh  <v'h.  For  any  controller  C,  ifT i(vgiriit,Vh,vT)  passes,  it 
means  that  the  pos.z  always  stays  below  Vh  for  the  time  interval  [0,ty];  which  implies  that  it 
stays  below  v'h  as  well  and  hence  T\{veinit,  v'h,  vT )  will  pass.  Thus  T\(6inn,  h,  r)  is  monotonic 
in  h. 

Similarly,  for  the  order  >  on  the  parameter  r  and  two  values  vT  >  v'T,  if  a  test 
F l  (V0vnif>  vh,  vt)  passes  for  any  controller,  it  means  that  the  pos.z  always  stays  below  Vh 
for  the  time  interval  [0,  vT],  which  implies  that  the  same  is  true  for  the  time  interval  [0,  v'T] 
and  hence  the  test  Ti(vQinit,Vh,v'T)  will  also  pass. 

We  can  extend  the  definition  of  monotonicity  to  sets  of  parameters  by  defining  required 
orders  on  tuples  of  parameter  values.  For  example,  Ti  (0init,  h,  t)  is  monotonic  in  (h,r)  if 
we  consider  as  the  order,  where  (vf-n  vT)  ^  ( v'h,v'T )  iff  Vh  <  v'h  and  vT  >  v'T.  Note  that  we 
do  not  need  separate  monotonically  increasing  and  decreasing  parameterized  tests  since  we 
can  always  invert  the  order  on  the  parameter  and  keep  the  definition  consistent. 

Note  that  the  definition  of  monotonicity  allows  a  parameterized  test  to  be  monotonic 
in  environment  parameters  but,  so  far  in  practice  we  have  never  encountered  cases  when 
this  happens.  Checking  that  a  parameterized  test  is  monotonic  in  certain  parameters  that 
only  occur  in  the  PSTL  part  of  the  test  can  be  done  by  reduction  to  satisfiability  modulo 
theories  (SMT)  as  described  in  more  detail  by  Jin  et  al.  [26].  This  is  an  offline  step  carried 
out  at  the  time  of  design  of  a  parameterized  test. 
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Definition  10  (Monotone  Bit-Array)  For  two  indices  j  =  \ji,j2,  ,jk]  and  $  = 
\j[,  j'2,  •  •  •  ,  Jk\  °f  a  k-dimensional  bit-array  A,  we  say  j  <  j'  iff  ji  <  j[,j2  <  j2,  ■  ■  ■  ,  Jk  <  j'k- 
The  array  A  is  said  to  be  monotone  if  for  any  indices  j  and  ]'  s.t.  j  <  ,  A[j]  =  1  implies 

that  A$]  =  1. 

We  now  describe  how  monotonicity  proves  to  be  a  useful  property  to  efficiently  com¬ 
pute  I2(C,  T(p)).  First  consider  the  case  when  T(p)  is  monotonic  in  all  k  parameters 
Pi,P2,  ■  ■  ■  ,Pk-  Owing  to  monotonicity,  we  can  index  the  valuations  using  their  respective 
orders  s.t.  for  any  controller  C ,  the  fc-dimensional  bit-array  representation  of  fl(C,  T(p))  is 
monotone.  We  describe  an  algorithm  to  compute  f2(C,  T(p))  in  three  separate  cases. 

3.3.1  Case:  k=l 

For  the  single  parameter  p\  we  can  perform  a  binary  search  within  its  domain  to  de¬ 
termine  the  index  b  such  that  r(il[ji  =  6](p))  does  not  pass  on  C  while  r(il[ji  =  6  +  1]) 
passes.  We  would  have  to  perform  (9 (log  N)  tests. 

3.3.2  Case:  k=2 

For  two  parameters  p±  and  p2,  say  we  have  the  2-d  array  of  indices  [1  •  •  •  U]  x  [1  •  •  •  V]. 
We  start  at  the  index  (row  =  1  ,col  =  V).  At  each  step  we  perform  the  test  F(il[ji  = 
row,j 2  =  coZ](p))  on  C.  If  the  test  passes,  we  mark  the  complete  column  fi(C,  r(p))[ji  > 
row,  j‘2  =  col]  with  Is  (we  can  do  this  because  of  monotonicity)  and  decrement  col  by  1.  If 
the  test  does  not  pass,  we  mark  the  complete  row  fl(C,  r(p))[ji  =  row,j\  <  col]  with  Os 
and  increment  row  by  1.  We  do  this  until  we  have  covered  the  whole  array.  We  would  have 
to  perform  (9(max(C7,  V ))  =  0(N )  tests  since  we  mark  out  a  complete  row  or  column  after 
every  test.  Figure  3.2a  shows  an  intermediate  step  in  a  run  of  this  algorithm. 
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3.3.3  Case:  k>3 


For  more  than  2  parameters,  we  enumerate  over  all  possible  valuations  of  first  k  —  2  pa¬ 
rameters  and  use  the  case  for  k  =  2  for  the  2-d  sub-array  obtained  by  fixing  pi,P2,-  ■  ■  ,Pk- 2- 
We  would  have  to  perform  0(Nk~1)  tests.  We  cannot  hope  to  do  (asymptotically)  better 
than  this  as  it  is  shown  in  [27]  that  searching  in  a  monotone  d-dimensional  array  where 
each  dimension  is  of  size  at  most  n  is  lower  bounded  by  C2(d)nd~1,  where  02(d)  =  0(d^~ ) 
for  d  >  2. 

For  the  general  case,  let  T(p)  be  non- monotonic  in  the  first  k  —  d  parameters  and 
monotonic  in  the  d  others.  We  enumerate  over  all  possibilities  of  the  first  k  —  d  parameters 
and  apply  the  algorithm  for  monotonic  parameters  to  the  d  dimensional  sub-array  obtained 
by  fixing  pi ,  p2,-“  ,Pk-d- 

Using  the  above  approach,  we  can  compute  P(C+,r(p)),  T(p))  and  p  = 

n(C+,r(p))  \  U(C_,r(p)),  all  represented  in  the  form  of  fc-dimensional  bit-arrays. 

3.4  Adequate  Test  Samples  for  Grading 

Checking  whether  a  new  controller  C  succeeds  on  a  test  bench  (r(p),p)  amounts  to 
searching  for  a  valuation  in  p  such  that  r(u(p))  passes  for  C.  The  naive  approach  to  solve 
the  search  problem  is  to  enumerate  all  valuations  in  p.  We  describe  a  more  efficient  search 
strategy  when  T(p)  is  monotonic  in  one  or  more  parameters. 

Definition  11  (Adequate  Test  Sample)  An  adequate  test  sample  a  C  p  is  a  set  of  valua¬ 
tions  s.t.  for  any  controller  C,  (T(p),p)  succeeds  on  C  iff  there  is  at  least  one  v  €  a  for 
which  r(u(p))  passes  for  C . 

Definition  12  (Corner)  A  corner  in  a  monotone  k-dimensional  bit-array  A  is  an  index 
j  =  \j1J2,  ■  ■  ■  ,jk \  s.t.  4[j]  =  0  and  VI  <  l  <  k,  if  the  index  [ji,  j2,  •  •  ■  ,jl  +  1,  ■  ■  ■  ,jk\  lies 
within  the  bounds  of  A,  then  A[j±,j2,  ■  ■  ■  ,ji  +  1,  ■  —  ,jk\  =  1- 
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First  consider  the  case  when  a  parameterized  test  T(p)  is  monotonic  in  all  parameters 
p  =  (pi,p-2,-"  iPk)-  Say  we  have  computed  n(C+,  T(p)),  n(C_,  T(p))  and  p  =  0(C+,  T(p))\ 
n(C-,r(p))  in  ^-dimensional  bit-array  form. 

Proposition  1  The  set  a  comprising  of  all  valuations  U[j]  s.t.  j  is  a  corner  o/fl(C~,  T(p)) 
and  fi(C+,r(p))[j]  =  1,  is  a  minimal  adequate  test  sample  for  (r(p),p). 

Proof.  We  first  show  that  a  is  adequate  then  we  show  a  is  also  minimal.  For  this 
proof,  we  refer  to  H(C+,r(p))  by  and  fl(C+,r(p))  by  fl-. 

Assume  r(u(p))  passes  for  C  for  some  v  G  a.  Let  the  index  of  this  valuation  be  j„.  By 
definition  of  a,  f2+[j„]  =  1  and  j„  is  a  corner  of  implying  12“  [j^]  =  0.  From  the  way 
we  have  defined  p,  we  can  say  that  p[ j„]  =  1  or  v  G  p  which  means  (r(p),p)  succeeds  for 
C.  For  reverse  implication,  assume  (r(p),p)  succeeds  for  C,  it  means  that  it  is  possible 
to  find  an  index  j  =  [ji,  j2,  ■  •  •  ,jk]  s.t.  il[j]  G  p  (equivalently,  p[j]  =  1)  and  r(w(p))  passes 
for  C  (equivalently,  fi(C,  r(p))[j]  =  1).  Since  j  G  p,  we  have  12+[j]  =  1  and  Q_ [j]  =  0. 

If  j  is  a  corner  of  12“ ,  then  we  have  it[j]  G  a  and  we  are  done.  If  not,  then  there  exists 

1  <  l  <  k,]'  =  [ji,  ]2 ,  •  •  ■  ,  ji  +  1 ,  '  ■  ■  ,  jk]  s.t.  =  0.  By  monotonicity,  we  also  have 

fl+[j']  =  1  and  Q(C,  r(p))[j/]  =  1.  If  j'  is  a  corner  of  then  il[j]  G  a  and  we  are  done. 

Else  we  set  j  to  j'  and  proceed  again  in  the  same  way.  Since  11  is  finite,  this  procedure  is 

guaranteed  to  terminate  at  a  corner  of  Ci~ . 

To  show  minimality,  we  remove  some  arbitrary  valuation  v  from  a  and  show  that  it 
becomes  inadequate.  Say  jv  is  the  index  corresponding  to  v.  Consider  a  controller  C 
s.t.  fi(C,  r(p))[j]  =  1  iff  j  >  j„.  Since  j„  is  a  corner  of  Q~,  for  every  index  j  /  j„  and  j  >  jt), 
we  have  that  H_[j]  =  1.  This  means  there  is  no  corner  of  in  fl(C,  T(p))  apart  from  ji;. 
Hence,  we  will  not  be  able  to  find  another  v'  G  a,  v'  /  v  s.t.  r(r/(p))  passes  on  C,  even 
though  (r(p),  p))  succeeds  on  C .  This  means  a  becomes  inadequate  if  we  remove  any  of  its 
elements,  thus  making  it  minimal. ■ 

Figure  3.2b  shows  an  example  of  a  minimal  adequate  test  sample  for  the  2-rl  case. 
To  compute  a,  similar  to  Sec.  3.3;  in  case  k  =  1,  we  can  do  a  binary  search  to  find  the 
corner;  in  case  k  =  2,  we  can  we  can  find  corners  by  starting  at  the  boundary  of  the 
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g^(c,r(P))  □efi(C',r(P)) 


o/i 


result  of  test  performed. 
1  iff  passed 

(a) 


(b) 


Figure  3.2:  (a)  An  intermediate  step  in  a  run  of  the  algorithm  used  to  compute  fl(C,  T(p)) 
for  two  monotonic  parameters  p  =  (p  \ ,  p2 )  •  The  arrows  indicate  the  tests  that  are 
performed.  Monotonicity  allows  us  to  compute  whole  of  Q(C,  T(p))  by  performing 
0(max(U,  V))  tests,  (b)  For  the  case  of  two  monotonic  parameters  (increasing  in  the  di¬ 
rections  shown  by  arrows),  the  dashed  (and  dotted)  lines  represent  the  boundary  between 
cells  containing  Os  and  Is  for  fl(C+,r(p))  (and  fl(C_,  T(p))).  The  shaded  part  is  p.  The 
hatched  cells  are  corners  of  T(p))  and  the  shaded  hatched  cells  comprise  the  minimal 

adequate  test  sample. 
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2-d  array  and  eliminating  rows  and  columns;  and  in  case  k  >  3,  we  can  enumerate  over 
first  k  —  2  parameters  and  apply  the  case  for  k  =  2  on  the  rest.  For  the  general  case  of 
k  —  d  non-monotonic  and  d  monotonic  parameters,  we  enumerate  over  all  possibilities  of 
first  k  —  d  parameters,  and  keep  accumulating  the  adequate  test  sample  calculated  for  the 
d-dimensional  monotone  sub-array  obtained  by  fixing  the  first  k  —  d  parameters. 

We  conclude  this  section  with  a  remark  about  an  alternative  mathematical  formulation. 
If  we  treat  a  monotone  bit-array  as  a  partially-ordered  set  (poset)  D,  then,  the  satisfaction 
region  fi(C,  T(p))  of  some  controller  C  is  an  upward  closed  subset  of  O.  The  sub-domain  p 
is  now  the  intersection  of  an  upward-closed  (f2+)  and  another  downward-closed  set 
With  some  effort,  we  can  show  that  the  minimal  adequate  test  sample  a  corresponds  to  the 
maximal  elements  of  p.  However,  we  find  the  monotone  bit-array  formulation  more  useful 
for  our  purposes  because  it  is  a  special  case  of  a  poset  that  allows  for  efficient  algorithms  (as 
given  in  Sec.  3.3.1  and  3.3.2)  for  computation  of  a,  which  is  not  obvious  with  the  general 
poset  formulation. 


3.5  Related  Work 

Parameter  synthesis  for  PSTL  formulas  has  been  studied  before  [23],  [26].  Unlike  our 
work,  these  efforts  seek  to  find  specific  parameter  values  rather  than  sub-domains,  and  are 
not  directly  usable  in  the  auto-grading  context  of  this  paper.  A  symbolic  approach  to  PSTL 
parameter  synthesis  has  been  discussed  in  [23],  which  reports  that  an  enumerative  approach 
outperforms  the  symbolic  one. 

We  also  note  related  work  in  the  area  of  fault  localization  only  using  execution  traces 
(black-box  localization)  [28],  [29].  However,  these  techniques  apply  to  digital  systems  and 
are  not  directly  usable  in  our  context  of  hybrid  systems  with  continuous  variables. 
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Chapter  4 


Clustering-Based  Active  Learning 


In  this  chapter,  we  give  the  details  of  an  active  learning  algorithm  based  on  clustering, 
that  we  use  to  select  the  set  of  controllers  that  should  be  labeled  to  serve  as  the  training 
set.  For  ease  of  presentation  in  this  chapter,  we  make  a  simplifying  assumption  that  pa¬ 
rameterized  tests  do  not  contain  any  environment  parameters.  For  parameterized  tests  that 
contain  environment  parameters,  we  can  apply  the  same  technique  by  enumerating  over  the 
environement  parameters.  The  synthesis  algorithm  described  in  Sec.  3.3  is  used  as  a  black 
box  training  module  Train,  which  takes  as  input  a  parameterized  test  T(p)  =  ((p(p),E) 
(again  without  environment  parameters),  and  two  sets  of  controllers  C+  and  C~  (positively 
and  negatively  labeled  training  data),  and  gives  as  output  a  test  bench  (r(p),p).  A  syn¬ 
thesized  test  bench  (T(p),  p)(also  referred  to  as  a  classifier  in  this  chapter),  is  then  used  by 
a  classification  module  Classify  that  can  label  new  solutions  as  being  faulty  or  not.  In 
other  words,  given  a  dataset  V  of  solutions,  Classify  will  output  a  partition  of  V  into  two 
sets  Vq  and  V\,  of  solutions  labeled  0  and  1,  corresponding  to  fault  being  present  and  not 
respectively. 

Generating  labeled  data  for  the  training  module  is  expensive.  An  instructor  would  have 
to  manually  look  at  the  simulation  video  to  determine  whether  solution  is  faulty  or  not. 
This  is  the  problem  we  tackle  here.  How  can  we  make  generation  of  training  examples  easier 
and  more  efficient? 
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4.1  Iterative  Synthesis  of  Test  Benches  by  Active  Learning 


Active  Learning  [30]  is  a  form  of  machine  learning  where  the  learning  algorithm  is  able 
to  interactively  query  the  user  to  get  the  correct  labels  for  new  data  points.  Our  problem  fits 
well  within  this  definition.  We  extend  the  training  module  with  another  selection  module 
Select  that  decides  which  new  controller(s)  to  get  a  correct  label  for.  The  overall  active 
learning  procedure  is  iterative.  Algorithm  1  takes  a  dataset  of  solutions,  an  expert  labeling 
oracle  (that  generates  true  labels),  and  a  parameterized  test  corresponding  to  a  fault  as 
input,  and  outputs  a  synthesized  test  bench.  The  algorithm  works  iteratively  by  using 
clustering  to  select  the  controllers  to  be  added  to  the  training  data  and  using  the  synthesis 
procedure  described  in  Sec.  3.3  at  each  step.  The  training  module  first  generates  a  classifier 
based  on  some  sets  of  training  controllers.  Depending  on  the  results  of  the  classifier,  the 
controllers  labeled  as  0  and  1  are  separately  clustered  by  a  clustering  module  Cluster. 
Using  the  clusters  formed,  the  selection  module  Select  chooses  new  controllers  to  get 
correct  labels  for.  All  the  selected  controllers  that  were  incorrectly  labeled  by  the  classifier 
are  now  added  to  the  training  set  and  the  classifier  is  trained  again.  This  continues  until  no 
fresh  training  data  is  added.  Details  of  the  clustering  algorithm  and  the  selection  module 
are  given  in  the  following  sections. 

4.2  Clustering  with  Precomputed  Distances 

Cluster  performs  density-based  spatial  clustering  (DBSCAN)  on  a  set  of  unlabeled 
controllers.  DBSCAN  only  takes  pairwise  distances  among  the  data  points  as  input.  There 
is  no  need  to  specify  a  feature  vector  or  the  number  of  cluster  apriori.  We  use  multi¬ 
dimensional  dynamic  time  warping  (DTW)  (with  point-wise  Euclidean  distance)  as  the 
measure  of  distance  between  two  controllers  for  a  given  environment.  More  concretely,  say 
for  a  parameterized  test  T(p)  =  (</?(p),  E ),  the  set  of  variables  that  occur  in  the  formula  cp 
is  V.  Given  a  controller  C,  we  can  obtain  a  simulation  trace  sim((7,  E),  which  is  a  multi¬ 
dimensional  timed  sequence.  Note  however,  that  the  classification  would  only  depend  on  the 
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Algorithm  1:  IterativeSynthesis 


Input:  A  dataset  of  student  solutions  V,  a  true  labeling  oracle  O,  and  a 
parameterized  test  T(p)  =  (^(p),^)  corresponding  to  some  fault 
Output:  A  classifier  (r(p),p)  for  V 

1  C+,C~  <-  0 

2  repeat 

3  (r(p),p) TRAiN(r(p),c+,c-) 

4  Vo,T>i  <r-  CLASSIFY((r(p),/9),P) 

5  6q,  0\  <-  Cluster(770),  Cluster(Pi) 

6  TZ0,TZi  <r-  Select(6io),  Select(6*i) 

7  C+  =  {C  s.t.  (C  G  72.0  A  0(C)  =  with  Jault)} 

s  C+  =  C+  U  C+ 

9  =  {C  s.t.  (C  €  Hi  A  0(C)  =  without.fault)} 

10  C~  =  C~  U 

11  until  or  is  empty 

12  return  (r(p),p) 


variables  V,  hence  we  project  out  rest  of  the  variables  from  the  simulation  trace  sim(C',  E). 
We  compute  DTW  distance  on  the  resulting  multi-dimensional  timed  sequence. 

4.3  Selection  of  Training  Data  from  Clusters 

The  selection  module  implements  the  policy  used  for  selecting  data  points  to  be  added 
to  the  training  set.  This  is  done  bearing  in  mind  that  the  training  algorithm  works  well 
if  the  training  data  is  balanced  in  terms  of  number  of  positive  and  negative  examples. 
Training  data  balancing  is  a  standard  technique  in  machine  learning  [31].  This  is  specially 
important  in  our  context  because  some  faults  are  rare,  and  other  are  very  common  (in 
non- final  versions  of  solutions),  and  hence  the  occurrence  of  positive  and  negative  examples 
is  imbalanced. 

For  initialization  of  the  training  set  during  the  first  iteration,  we  cluster  all  the  samples 
using  the  clustering  module.  We  then  select  a  randomly  chosen  sample  from  each  cluster, 
look  up  its  label  and  add  to  the  training  set.  If  the  number  of  samples  for  positive  and 
negative  training  is  skewed,  we  continue  picking  more  training  instances  until  either  a 
threshold  upper  bound  is  reached  or  we  are  unable  to  reduce  the  skew  any  further.  To 
reduce  the  skew,  we  randomly  pick  a  cluster  from  which  a  minority  instance  was  obtained 
(positive  or  negative)  and  sample  again  hoping  to  obtain  another  instance  of  the  minority 
class  thereby  reducing  the  skew. 

Once  the  initialization  step  is  complete,  we  move  on  to  running  the  classifier  and  ob¬ 
taining  predicted  labels  on  the  test  set.  If  the  accuracy  on  the  test  set  is  not  100%,  we  try 
and  improve  our  training  set  by  adding  examples  of  samples  which  were  marked  wrongly. 
In  order  to  achieve  that,  we  re-cluster  all  the  samples  (test  and  training)  in  each  class  sep¬ 
arately,  randomly  pick  a  cluster  which  has  not  been  already  represented  in  the  training  set 
(i.e.  the  cluster  and  the  training  set  has  no  sample  in  common)  and  pick  a  random  sample 
from  the  same.  We  do  this  for  both  class  and  add  the  respective  sample  to  the  training  set  if 
the  predicted  label  was  not  same  as  the  actual  label.  This  step  is  performed  in  a  loop  until 
we  are  unable  to  increase  the  size  of  the  training  set  or  100%  accuracy  has  been  achieved. 
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4.4  Related  Work 


DTW  has  been  previously  used  for  classification  of  temporal  sequences  of  video,  audio, 
and  graphics  data  [32],  using  an  algorithm  similar  to  k- nearest  neighbours  [33].  Active 
learning  is  a  popular  methodology  for  cases  where  obtaining  training  data  is  expensive, 
using  strategies  like  uncertainty  sampling,  expected  model  change,  expected  error  reduction, 
etc.  [30]  We  have  not  seen  past  work  that  applies  clustering  based  strategies  for  active 
learning. 
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Chapter  5 


Evaluation 


The  design  and  initial  experimental  evaluation  of  CPSGrader  was  done  using  a  collection 
of  solutions  implemented  by  50  groups  of  students  as  part  of  the  laboratory  component  of 
the  Fall  2013  instance  of  the  EECS  149  class  at  UC  Berkeley. 

The  code  was  anonymized  and  collected  automatically  using  post-build  commands  so 
that  each  group  provided  a  variable  number  of  versions,  most  of  which  being  intermediate 
non-final  solutions.  The  lab  was  organized  in  two  sessions,  one  focusing  on  the  obstacle 
avoidance  problem,  and  another  focusing  on  the  hill  climbing.  In  this  section,  we  describe 
the  set  of  test  benches  that  we  used  to  establish  diagnostics  with  respect  to  each  goal.  For 
each  test  bench,  we  first  manually  label  a  set  T  of  100  randomly  selected  student  solutions. 
We  select  30  solutions  out  of  the  100  while  maintaining  balance  between  the  number  of 
positive  and  negative  examples  which  are  input  to  the  synthesis  algorithm.  To  elaborate, 
if  we  have  more  than  15  each  of  positive  and  negative  examples  (say  45  positive  and  55 
negative)  then  we  select  some  15  examples  of  each  type  arbitrarily.  If  either  one  of  positive 
or  negative  examples  is  less  than  15  (say  5  positive  and  95  negative),  then  we  select  all 
instances  of  the  type  of  example  that  is  scarce  and  select  the  remainder  of  the  30  from  the 
other  type  (in  the  example,  we  will  take  5  positive  and  25  negative).  This  is  a  standard 
technique  in  machine  learning  done  to  improve  coverage  and  reduce  bias  in  case  a  fault  is 
rare  [31].  In  Sec.  5.1  and  5.2,  for  each  test  bench,  we  describe  (1)  the  fault  symptom  and  the 
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corresponding  PSTL  formula,  (2)  environment  and  STL  parameters,  and  their  monotonic 
nature,  (3)  synthesized  sub-domain  and  adequate  test  sample,  and  (4)  synthesis  time  per 
training  example. 

In  Sec  5.3.1,  we  measure  accuracy  of  the  grader  by  comparing  labels  generated  by  the 
auto-grader  against  another  set  of  manually  graded  solutions  (disjoint  from  T).  We  also 
demonstrate  efficiency  in  terms  of  the  average  grading  time  per  solution. 

The  auto-grader  designed  as  described  above  was  used  in  the  MOOC  version  EECS 
149. lx.  Since  we  perform  synthesis  of  test  benches  based  on  a  training  set  obtained  form 
the  on-campus  version  of  the  course,  in  Sec.  5.3.2  we  evaluate  the  accuracy  of  the  grader 
on  a  set  of  student  solutions  collected  from  the  MOOC  to  show  robustness  of  the  grader  on 
a  new  data  set  that  might  have  different  kinds  of  variations.  We  also  study  the  correlation 
of  overall  grades  assigned  by  the  auto-grader  as  compared  to  grades  assigned  by  an  expert 
manual  grader  in  Sec.  5.4. 

In  Sec.  5.5,  we  evaluate  the  iterative  synthesis  algorithm  Alg.  1  (referred  to  as  ISyn  in 
the  rest  of  the  chapter)  by  comparing  it  against  the  technique  RANDOM  where  we  randomly 
choose  the  training  set  and  show  that  ISyn  can  obtain  higher  overall  accuracy,  with  a 
smaller  size  of  training  set  used. 

In  Sec.  5.6,  we  propose  a  senri-autonrated  methodology  for  identifying  new  fault  scenarios 
using  solutions  that  do  not  pass  the  objectives  but  also  do  not  exhibit  any  faults  in  our 
library.  This  methodology  is  based  on  clustering  of  simulation  traces. 

Experiments  are  performed  using  a  single  core  of  a  2.3  GHz  processor  with  8  GB  of 
memory.  Since  more  than  one  tests  share  the  same  environment  configuration,  we  run  sim¬ 
ulations  for  all  solutions  in  all  the  environment  configurations  as  needed  for  our  evaluation 
in  a  pre-processing  step  and  store  traces  to  files.  Each  simulation  is  run  for  60  secs  of  virtual 
time  with  a  step  size  of  5  ms  which  takes  about  10  secs  of  system  time.  For  each  test  bench, 
in  Sec.  5.1  and  5.2,  we  report  running  times  of  the  synthesis  algorithm  that  computes  the 
sub-domain  and  the  adequate  test  sample,  and  in  Sec.  5.3.1,  we  report  running  times  of 
the  auto-grader  which  checks  for  existence  of  a  passing  test  in  the  adequate  test  sample. 
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These  running  times  do  not  include  time  required  for  simulation  since  we  are  reading  traces 
from  files.  When  using  the  auto-grader  in  loop  with  the  simulator,  we  need  one  simulation 
for  every  environment  in  each  test  bench  per  solution  (the  aggregate  is  lower  in  practice 
because  more  than  one  test  benches  share  the  same  environment).  All  simulations  are  run 
using  NI  Robotics  Simulator.  STL  monitoring  is  performed  using  Breach  [13].  The  synthe¬ 
sis  modules  and  grading  software  with  an  extended  library  of  faults  is  made  available  at  the 
CPSGrader  website  [34], 

5.1  Obstacle  Avoidance 

In  assessing  faults  in  obstacle  avoidance,  we  use  an  environment  E^(9inn)  which  contains 
an  obstacle  occupying  the  region  [4.5,  5.5]  x  [5.0,  5.5].  Initial  position  of  the  robot  is  (5.0, 4.9). 
The  parameter  9init  encodes  the  initial  orientation  of  the  robot. 

Failing  simple  obstacle  avoidance  (avoid  front) 

This  test  bench  checks  whether  the  robot  can  get  past  the  obstacle  when  started  with 
the  initial  orientation  9inu  =  0,  facing  the  obstacle  directly. 

•  Parameterized  Test:  (E3( 0),  ^orient)  with  ^orient  =  □[o,r](Pos-y  A  Ifmin )•  If  ^orient  is 
satisfied  for  suitable  values  of  r  and  ymm,  it  indicates  failure  to  avoid  the  obstacle. 

•  Parameters:  (■ T,ymin ) 

•  Domain:1  (■ T,ymin )  £  {60  :  —5  :  10}  X  {3.0  :  0.1  :  7.0} 

•  Monotonicity:  r  monotonic  for  >  and  ymin  monotonic  for  <. 

•  Synthesized  sub-domain:  See  Figure  5.1a 

•  Adequate  Test  Sample:  {(60,  5.7),  (55, 4.9),  (50, 4.6)} 

•  Average  synthesis  time  per  training  example:  1.9  sec 

1The  notation  {a  :  d  :  b}  denotes  the  set  {a,  a  +  d,a  +  2d,  •  •  •  ,  a  +  kd},  where  k  is  the  greatest  integer 
s.t.  if  d  >  0  then  a  +  kd  <  b  else  if  d  <  0  then  a  +  kd  >  b 
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Figure  5.1:  (a)  Test  bench  avoidTront.  Green  (lightly  shaded)  region  is  the  computed  sub- 
domain.  Red  (dark  shaded)  region  is  the  set  of  tests  excluded  from  the  sub-domain  because 
they  are  triggered  on  at  least  one  negative  example.  White  (unshaded)  region  is  the  set  of 
tests  that  are  not  triggered  on  any  negative  or  positive  example.  Little  black  squares  are 
the  points  in  the  adequate  test  sample,  (b)  Test  bench  circle. 

Failing  re-orienting  after  obstacle  avoidance  (avoid _left/avoid_right) 

This  test  bench  checks  whether  the  robot  can  get  past  the  obstacle  and  keep  heading  in 
the  initial  heading  direction.  We  perform  the  test  in  two  possible  initial  orientations;  facing 
left  (dinit  =  45)  or  right  (Qm,t  =  —45).  We  show  details  for  the  case  dinu  =  45. 

•  Parameterized  Test:  (F3(45),  ^reorient)  with  (^orient  =  □[o,r]  (pos .y  <  ymin\J  pos.x  > 
Xmax)-  If  ^reorient  is  satisfied  for  suitable  values  of  r,  xmax  and  ymin,  it  indicates  either 
failure  to  avoid  the  obstacle  or  failure  to  re-orient  in  the  correct  heading  direction. 

•  Parameters:  ( T,ymin  i  % max  ) 

•  Domain:  (T,ymin,  xmax)  £  {60  :  —5  :  10}  X  {3.0  :  0.1  :  7.0}  x  {6.0  :  —0.1  :  3.0} 

•  Monotonicity:  r  monotonic  for  >;  ymin  monotonic  for  <  and  xmax  monotonic  for  >. 

•  Synthesized  sub-domain:  Due  to  more  than  2  parameters,  it  is  not  possible  to  show  it  in 
a  figure. 

•  Adequate  Test  Sample:  {(60,  5.4,  4.2),  (55,  5.4,  5.0),  (50,  4.8,  5.8),  (10,  4.4,  5.8)} 
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•  Average  synthesis  time  per  training  example:  26.2  sec 


Strict  equality  check  (circle) 

This  test  bench  investigates  the  circle  fault  mentioned  in  Section  2.3.  The  purpose  of 
the  test  is  to  detect  that  at  some  time  instant  to,  the  robot  bumps  into  the  obstacle,  then 
turns  about  itself  with  a  maximum  period  of  r,  while  remaining  close  to  its  position  at  to 
with  a  margin  of  5. 

•  Parameterized  Test:  (E^O),  ^circle) 

^circle (^0 >  7")  =  0(^bump(^o)  A  0[0,2r]  (</-,fullturn(^0)  <^))) 

Where  ^bump(io)  =  bump(f0)  =  TRUE  and  ^fuiiturn  is  given  by  <pfuiiturn(io,  <5,  r)  =  A 
<Pciose(io,5)U[oir](^i80  A  ^ciose(io,^)u[o,r]^6i~o))  where  ^ciose(io,^)  =  dist(pos(f0),  pos) 
<  6  for  some  distance  function  dist  and  o  and  <^e»~i80  assess  that  angle  is  close  to 
0  degrees  and  180  degrees,  respectively.  The  suitable  value  for  the  parameter  to  can 
be  determined  by  the  first  collision  instant  with  the  obstacle,  which  is  common  to  all 
solutions  since  they  all  start  moving  forward  in  the  same  direction  (say  this  common 
value  is  to).  We  fix  to  to  to- 

•  Parameters:  (r,  5) 

•  Domain:  (t,6)  G  {1  :  1  :  10}  X  {-0.025  :  0.01  :  0.2} 

•  Monotonicity:  r  monotonic  for  <  and  5  monotonic  for  < 

•  Synthesized  sub-domain:  See  Figure  5.1b 

•  Adequate  Test  Sample:  {(5.5,  0.195),  (10.0,  0.075)} 

•  Average  synthesis  time  per  training  example:  2.7  sec 

5.2  Hill  Climbing 

To  assess  faults  in  the  hill  climbing  part  of  the  assignment,  we  use  an  environment  E^/3) 
which  contains  a  hill.  The  parameter  (3  encodes  the  initial  configuration  of  the  robot.  It 
can  take  two  values  B  and  M.  In  B  the  robot  starts  at  the  bottom  of  the  hill  facing  45 
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degrees  rightwards  of  uphill  and  in  M  the  robot  starts  on  the  hill  (midway  between  bottom 
and  top)  facing  downhill. 

Failing  simple  hill  climb  (hill  climb) 

This  test  bench  checks  whether  the  robot  fails  to  reach  near  the  top  of  the  hill.  We 
perform  this  test  for  both  possible  values  of  /3. 

•  Parameterized  Test:  (F4,  ^hiii)  with  =  □[0)T](pos.£  <  h ).  If  y?hiii  is  satisfied  for 
suitable  values  of  r  and  h,  it  indicates  failure  to  reach  near  top  of  the  hill. 

•  Parameters:  (/3,  t,  h) 

•  Domain:  ( /3,r,h )  G  {B,M}  X  {60  :  —5  :  10}  X  {—0.1  :  0.01  :  0.7} 

•  Monotonicity:  r  monotonic  for  >  and  h  monotonic  for  < 

•  Synthesized  sub-domain:  See  Figure  5.2 

•  Adequate  Test  Sample:  {(M,  55,  0.41),  (M,  50,  0.37),  (M,  35,  0.35),  (M,  15,  0.33),  (M, 
10,  0.31),  (B,  55,  0.45),  (B,  50,  0.34),  (B,  45,  0.18),  (B,  40,  0.07)} 

•  Average  synthesis  time  per  training  example:  6.2  sec 


Figure  5.2:  (a)  Test  bench  hilLclimb  (f3  =  M )  (b)  Test  bench  hilLclimb  (/ 3  =  B) 
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tau2 

Figure  5.3:  Test  bench  what_h  i  1 1 


Failure  to  detect  hill  (what  hill) 

This  test  bench  checks  the  failure  of  robot  to  detect  when  it  is  oh  a  hill.  This  is  a 

specific  bug  which  leads  to  failure  in  hill  climbing.  We  use  the  environment  £4  with  (3  =  B. 

•  Parameterized  Test:  ( EA(B ),  y^hilldet)  with  (^hilldet  =  0[o,n](Pfwd  U[r2j+00]^ciiff),  where 
<£>fwd  assesses  that  the  robot  is  moving  forward  and  ipc yg  assess  firing  of  cliff  sensor.  If 
this  property  is  satisfied  for  suitable  values  of  n  and  72,  it  means  that  the  robot  keeps 
driving  straight  until  it  hits  a  cliff  even  if  it  is  on  a  hill  instead  of  re-orienting  towards 
uphill  direction. 

•  Parameters:  (71,72) 

•  Domain:  (71,72)  E  {0  :  1  :  60}  x  {60  :  — 1  :  0} 

•  Monotonicity:  T\  monotonic  for  >  and  72  monotonic  for  < 

•  Synthesized  sub-domain:  See  Figure  5.3 

•  Adequate  Test  Sample:  {(1,  0),  (41,  12),  (60,  13)} 

•  Average  synthesis  time  per  training  example:  8.3  sec 
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No  filtering  (filter) 


This  test  bench  checks  whether  the  reason  for  a  failure  to  climb  a  hill  is  the  absence 
of  a  low-pass  filter  applied  to  the  accelerometer  data  to  smoothen  it.  We  check  this  by 
performing  the  test  hilLclimb  with  E 4  but  applying  a  low-pass  filter  to  the  accelerometer 
data  externally  (before  it  is  fed  into  the  controller).  If  the  robot  is  able  to  climb  the  hill 
with  an  external  filter  but  fails  to  do  so  without  it,  we  can  conclude  that  absence  of  the 
filter  is  the  bug. 


5.3  Accuracy  of  Classification 

To  measure  accuracy  we  use  the  synthesized  test  benches  to  label  a  set  of  student 
solutions  (disjoint  from  the  training  set)  and  compare  the  labels  assigned  by  the  auto¬ 
grader  to  manually  assigned  labels.  We  evaluate  on  the  set  of  solutions  collected  from  both 
the  on-campus  offering  of  the  course  as  well  as  the  MOOC  version. 

5.3.1  On-campus  EECS  149 

Table  5.1  shows  obtained  accuracy  results  and  average  running  times  for  8  test  benches. 
The  running  times  do  not  include  time  needed  for  simulation.  For  each  solution,  simulation 
in  a  total  of  6  environment  configurations  is  collectively  needed  for  the  8  test  benches  (2 
environments  are  shared).  Note  that  we  find  a  majority  of  solutions  that  are  not  able  to 
meet  goals  but  that  is  expected  because  our  solution  set  has  preliminary  and  intermediate 
versions  of  the  solutions  as  well.  We  also  find  that  accuracy  is  poorer  in  the  hill  climbing 
cases,  which  shows  that  variation  in  student  solutions  is  higher  in  that  part  of  the  lab. 

5.3.2  edX  MOOC  EECS  149.1x 

Table  5.2  shows  obtained  accuracy  results  by  running  CPSGrader  on  student  solutions 
collected  from  the  MOOC.  Here  we  find  that  a  majority  of  solutions  meet  the  goals  and  this 
is  because  most  solutions  are  collected  from  the  final  assignment  submissions.  The  overall 
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Test  Bench 

N+ 

N++ 

N~ 

N 

Tavg 

avoidTront 

74 

74 

27 

27 

0.119 

avoidJeft 

78 

78 

23 

23 

0.158 

avoid_right 

82 

82 

19 

19 

0.148 

circle 

2 

2 

99 

99 

0.382 

hilLclimb  (/?  =  B ) 

49 

36 

345 

345 

0.111 

hilLclimb  (/?  =  M) 

35 

32 

359 

359 

0.120 

whatJiill 

220 

216 

174 

156 

0.288 

filter 

8 

7 

354 

339 

0.412 

Table  5.1:  N+  is  the  number  of  solutions  with  fault  (manually  labeled).  N++  is  the 
number  of  solutions  that  the  auto-grader  correctly  labeled  as  faulty.  N~  and  N~~  are 
defined  similarly  for  solutions  without  fault.  Tavg  is  the  average  labeling  time  per  solution 
in  seconds. 


accuracy  is  poorer  as  compared  to  the  on-campus  dataset  but  that  is  expected  because  all 
test  benches  are  synthesized  using  reference  solutions  chosen  from  within  the  on-campus 
data  set.  The  test  bench  filter  is  excluded  from  this  evalution  because  accelerometer  filtering 
was  added  as  a  default  in  the  simulator  for  the  MOOC  offering. 


Test  Bench 

N+ 

N++ 

N~ 

N— 

avoidTront 

189 

181 

1018 

1014 

avoidJeft 

172 

167 

1035 

1035 

avoid_right 

172 

169 

1035 

960 

circle 

10 

10 

1197 

1196 

hilLclimb  (/?  = 

B) 

360 

304 

234 

230 

hilLclimb  (/?  = 

M ) 

236 

175 

358 

346 

what  h i  1 1 

314 

312 

280 

194 

Table  5.2:  Notation  is  same  as  in  Table  5.1 


5.4  Grade  Correlation 

We  study  how  the  overall  grades  assigned  by  an  expert  are  related  to  the  grades  assigned 
by  CPSGrader  on  the  MOOC  data.  Overall  grades  are  calculated  based  on  how  many 
assignenmt  goals  (avoid_front,  avoid_right,  avoidJeft,  hilLclimb)  the  solution  meets  and  does 
not  depend  on  the  presence/absence  of  specific  faults  (circle,  what  hill,  filter).  The  faults  are 
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only  meant  for  feedback  and  debugging  support.  We  assign  1  point  for  each  goal  met,  thus 
grading  on  a  scale  of  0  to  5.  Table  5.3  notes  the  number  of  solutions  that  achieved  each  grade 
bar.  The  correlation  coefficient  of  expert  grades  v/s  CPSGrader  assigned  grades  is  found  to 
be  0.87.  These  results  show  that  CPSGrader  assigns  grades  that  are  highly  correlated  with 
expert  grades.  The  grade  distribution  appears  to  be  slightly  skewed  towards  lower  grades 
for  CPSGrader  as  compared  to  expert  grades.  This  is  against  the  intuition  that  the  test 
benches  in  CPSGrader  are  designed  to  be  lenient  and  needs  further  investigation. 


Grade  Bar 

Expert 

CPSGrader 

0 

14 

11 

1 

11 

12 

2 

11 

32 

3 

182 

216 

4 

185 

209 

5 

192 

115 

Table  5.3:  Number  of  solutions  at  each  grader  bar  for  the  Expert  grader  and  CPSGrader. 


5.5  Effectiveness  of  Iterative  Synthesis 

To  evaluate  the  active  learning  technique  developed  in  Sec.  4.1,  we  compare  our  tech¬ 
nique  ISyn  against  the  technique  Random  where  we  choose  our  training  set  uniformly 
at  random  from  the  complete  dataset.  We  evaluate  the  two  techniques  based  on  overall 
accuracy  achieved,  the  size  of  training  set  used,  and  the  balance  of  training  labels.  For  each 
fault,  we  train  the  test  bench  using  both  ISyn  and  Random  and  then  test  the  accuracy 
of  the  obtained  test  bench  on  a  disjoint  set  of  solutions.  To  simplify  the  comparison,  we  set 
the  upper  bound  on  the  number  of  training  instances  used  in  ISyn  and  total  number  of 
randomly  chosen  samples  in  Random  as  30.  In  some  cases,  ISyn  may  terminate  with  less 
than  30  examples  in  the  training  set  if  the  clustering  algorithm  is  not  able  to  find  enough 
number  of  clusters.  To  compare  accuracy  we  note  the  True  Positive  Rate  (TPR),  True  Neg¬ 
ative  Rate  (TNR),  and  F-score  for  both  techniques.  The  F-score  is  specifically  insightful 
for  our  current  application  auto-grading,  because  the  classifier  is  inherently  lenient  and  a 
better  classifier  would  be  the  one  that  can  identify  at  least  a  few  cases  on  existence  of  faults 
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in  the  solutions.  Analysis  results  for  the  7  distinct  faults  are  shown  in  Table  5.4.  From  the 
table,  it  can  be  seen  that  F-score  is  individually  higher  in  case  of  ISyn  than  Random  for 
all  the  faults  except  avoidJeft,  thus  leading  to  the  conclusion  that  ISyn  leads  to  better 
accuracy  of  classification  than  Random  for  equal  or  lesser  size  of  training  set.  It  is  difficult 
to  diagnose  the  reason  for  the  avoidJeft  exception  because  the  algorithm  depends  on  the  fine 
tuning  of  many  different  parameters  of  both  DTW  and  DBSCAN.  For  the  fault  avoid_right, 
we  see  that  ISyn  performs  significantly  better  than  Random  for  positive  examples  but 
worse  for  negative  examples.  In  this  case  the  reason  is  that  ISyn  ends  up  selecting  only 
13  (30  being  the  upper  bound)  training  examples  because  Cluster  cannot  find  enough 
number  of  clusters  even  for  a  wide  range  of  configuration  settings. 

As  we  noted  before,  training  data  balancing  is  important  for  the  training  algorithm  to 
work  well.  In  order  to  evaluate  how  well-balanced  are  the  training  sets  obtained  using  the 
two  techniques,  we  use  balancing  ratio  i.e.  the  ratio  of  number  of  negative  training  examples 
and  number  of  positive  training  examples.  The  closer  this  value  is  to  1,  the  better  balanced 
are  the  two  training  sets.  Table  5.4  gives  a  clear  break  down  of  the  number  of  positive  and 
negative  training  examples  used  for  each  fault  per  technique.  In  our  evaluation,  we  find 
that  this  ratio  was  ~4.3  for  Random  while  it  was  ~1.2  for  ISyn  ,  ISyn  leads  to  more 
balanced  training  sets. 

Since  ISyn  on  an  average  performs  better  than  Random  on  both  accuracy  measure 
and  balancing  measure,  we  believe  that  ISyn  is  a  better  choice  for  creating  smaller  yet 
more  effective  training  set  than  random  sampling. 


5.6  Investigating  Unknown  Faults  Using  Clustering 

CPSGrader  works  with  a  fixed  pre-defined  library  of  faults  and  associated  test  benches. 
This  raises  a  natural  question.  How  do  we  handle  the  presence  of  a  fault  that  does  not 
exist  in  the  library  yet?  In  other  words,  how  do  we  extend  this  library  in  a  data  driven 
fashion?  We  attempt  to  answers  these  questions  partly  via  semi-automated  investigation 
of  the  solutions  that  do  not  meet  the  goals  of  the  assignment  and  also  do  not  exhibit  any 
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Test  Bench 

Training  Set  Size 

TPR 

TNR 

F-score 

Random 

ISyn 

Random 

ISyn 

Random 

ISyn 

Random 

ISyn 

avoicLfront 

23/133  +  7/74 

15/133  +  15/74 

133/133 

133/133 

67/74 

70/74 

0.97 

0.99 

avoidJeft 

23/164  +  7/45 

15/164  +  15/45 

164/164 

164/164 

42/45 

39/45 

0.99 

0.98 

circle 

1/7  -1-  29/200 

3/7  +  11/200 

0/7 

6/7 

200/200 

193/200 

0.00 

0.60 

hilLclimb(/3  =  B) 

26/427  +  4/63 

14/427  +  16/63 

427/427 

427/427 

55/63 

60/63 

0.99 

1.00 

hill_climb(/3  =  M) 

28/442  +  2/48 

29/442  +  1/48 

442/442 

442/442 

11/48 

18/48 

0.96 

0.97 

avoid_right 

24/169  +  6/40 

10/169  +  3/40 

70/169 

169/169 

40/40 

26/40 

0.59 

0.96 

Table  5.4:  Comparison  of  ISyn  and  Random  .  Training  Set  Size  denotes  the  (number  of 
positive  examples  selected  in  the  training  set) /(total  number  of  positive  examples  in  data 
set)  +  (number  of  negative  examples  selected  in  the  training  set) / (total  number  of  negative 
examples  in  the  data  set).  TPR  is  the  true  positive  rate  of  the  trained  classifier.  TNR  is 
the  true  negative  rate. 


faults  in  the  existing  library.  We  perform  this  analysis  separately  for  obstacle  avoidance  and 
hill  climbing  objectives.  For  both  the  objectives,  we  first  isolate  the  set  of  solutions  that 
do  not  meet  goals  of  the  assignment  (obstacle  avoidance  -  avoicLright,  avoicLfront,  avoidJeft; 
hill  climbing  -  hilLclimb)  and  also  do  not  exhibit  any  faults  existing  in  the  library  (circle, 
what  hill,  filter).  Then  we  cluster  the  simulation  traces  of  this  set  of  solutions  (in  some 
simple  default  environment  that  tests  the  objective)  using  DBSCAN  over  pairwise  DTW 
distances  as  described  previously  for  active  learning.  We  then  do  manual  analysis  of  the 
clusters  found  by  looking  at  similarities  between  the  simulation  traces  found  within  a  cluster 
and  also  the  source  code  of  the  controllers.  This  leads  to  several  interesting  findings  which 
we  describe  next.  This  analysis  was  carried  out  using  the  data  from  on-campus  offering. 

For  the  obstacle  avoidance  objective,  we  isolated  a  total  of  114  solutions  with  unknown 
faults.  DBSCAN  forms  4  clusters  of  size  85,  17,  5,  and  5  (2  points  were  identified  as  noise.) 
Investigation  of  how  the  minority  clusters  (17,  5,  5)  differed  from  the  majority  one  leads 
to  two  interesting  findings:  (1)  Symptom:  After  hitting  the  obstacle  once,  the  robot  drives 
away  in  a  direction  90  degrees  rightwards  of  the  initial  orientation  and  rams  into  the  wall 
on  the  right.  Possible  Defect:  Presence  of  an  extraneous  unguarded  transition  that  switches 
from  re-orient  to  the  drive  mode;  (2)  Symptom:  After  hitting  the  obstacle  first,  and  then 
hitting  the  wall  on  the  right,  the  robot  drives  away  in  a  direction  180  degrees  from  the 
initial  orientation.  Possible  Defect:  Improper  use  of  the  angle  sensor  while  checking  for 
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re-orientation  success.  The  angle  reads  between  -180  and  180  and  hence  absolute  values 
should  be  used  when  comparing  differences.  For  e.g.,  a  guard  that  check  for  angle  <  0  will 
become  true  both  when  angle  crosses  from  1  to  -1  and  179  to  -179  degrees.  Both  these 
symptoms  are  easy  to  characterize  as  STL  formulae. 

For  the  hill  climbing  objective,  we  isolated  a  total  of  174  solutions  with  unknown  faults. 
DBSCAN  forms  3  clusters  of  size  159,  4,  and  4  (7  points  were  identified  as  noise.)  The  traces 
in  the  minority  clusters  are  hard  to  distinguish  from  the  traces  in  the  majority  cluster,  hence 
we  do  not  have  interesting  findings  for  this  case. 

5.7  Discussion 

The  experimental  evaluation  indicates  that  CPSGrader  is  both  accurate  and  efficient. 
The  test  benches  used  in  our  evaluation  capture  common  mistakes  made  by  students,  as  ob¬ 
served  in  an  on-campus  offering,  and  even  simply  identifying  these  mistakes  can  be  valuable 
feedback. 

In  a  course  survey  filled  by  students  of  the  edX  MOOC  EECS  149. lx  after  completion  of 
the  course,  86%  of  the  students  reported  the  feedback  generated  by  the  auto-grader  critical 
in  helping  them  debug  and  solve  the  lab  exercises.  The  lab  also  featured  an  optional  hard¬ 
ware  track.  Among  the  students  who  chose  to  work  on  hardware,  more  than  90%  reported 
that  their  solutions  that  were  developed  on  the  virtual  lab  (equipped  with  CPSGrader) 
worked  on  the  hardware  with  no  or  minor  modifications. 

The  parameter  synthesis  requires  a  set  of  “good”  and  “bad”  solutions.  We  show  that 
a  small  number  of  labeled  examples  (30)  is  enough  to  get  reasonable  accuracy  in  two  dif¬ 
ferent  scenarios.  However,  generation  of  labeled  examples  with  good  coverage  of  possible 
variations  in  students  solutions  requires  an  instructor  to  view  the  simulation  video  and  la¬ 
bel  a  reasonably  large  number  of  student  solutions  until  all  major  variations  are  covered. 
We  show  that  this  process  can  be  made  easier  with  our  clustering-based  iterative  synthesis 
approach,  achieving  better  accuracy  with  fewer  number  of  training  examples. 
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Chapter  6 


Conclusion  and  Future  Work 


In  this  thesis,  we  have  formalized  the  auto-grading  problem  for  laboratory  assignments 
in  cyber-physical  systems,  and  presented  a  formal,  algorithmic  approach  to  solve  it  based 
on  parameter  synthesis.  The  approach  is  general  and  can  apply  beyond  the  particular 
motivating  lab  setting  considered  here.  The  theoretical  treatment  makes  no  assumptions 
about  the  form  of  the  controller,  environment,  and  simulation  model.  Note  also  that  our 
approach  can  be  used  with  any  black-box  simulator.  We  also  designed  and  evaluated  a 
clustering-based  active  learning  technique  for  selection  of  labeled  training  examples  for  the 
synthesis  algorithm.  Again,  this  clustering-based  active  learning  approach  is  general  and 
can  apply  to  any  setting  involving  learning  from  time-series  data. 

There  are  several  interesting  directions  for  future  work.  One  direction  is  to  introduce 
cost  or  reward  metrics  into  the  model  to  quantify  the  quality  of  a  student  solution.  Mon¬ 
itoring  these  metrics  over  a  set  of  tests  can  help  assign  partial  credit  or  extra  credit  to 
student  solutions.  For  example,  in  a  problem  involving  robot  navigation  to  a  goal  location, 
a  controller  that  gets  closer,  or  takes  less  time,  should  intuitively  receive  more  credit  than 
one  that  does  not.  Another  direction  is  to  develop  STL  mining  based  methods  for  synthe¬ 
sizing  the  form  of  the  STL  formulae  in  test  benches.  More  interesting  would  be  to  extend 
the  work  on  identifying  unknown  faults  by  developing  a  general  approach  to  synthesize  test 
benches  directly  from  unlabeled  examples  of  student  solutions  in  an  unsupervised  way. 
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As  mentioned,  the  auto-grader  has  already  been  successfully  deployed  in  an  actual 
MOOC,  EECS149.1x  [7],  and  we  have  run  user  studies  on  its  effectiveness.  In  a  course 
survey  filled  by  students  of  the  edX  MOOC  EECS  149. lx  after  completion  of  the  course, 
86%  of  the  students  reported  the  feedback  generated  by  the  auto-grader  critical  in  helping 
them  debug  and  solve  the  lab  exercises.  The  lab  also  featured  an  optional  hardware  track. 
Among  the  students  who  chose  to  work  on  hardware,  more  than  90%  reported  that  their 
solutions  that  were  developed  on  the  virtual  lab  (equipped  with  CPSGrader)  worked  on  the 
hardware  with  no  or  minor  modifications. 

We  are  exploring  many  avenues  to  use  CPSGrader  in  other  classes  and  labs.  One  inter¬ 
esting  topic  is  analog  and  mixed  signal  circuits,  for  which  Time  Frequency  Logic  (TFL  [35]) 
could  be  used  instead  of  STL. 

Finally,  beyond  the  application  to  education,  we  note  that  our  technique  can  be  applied 
to  debugging  problems  for  embedded  controllers  where  we  can  assume  a  plausible  fault 
model  and  where  monotonicity  holds;  e.g.,  for  industrial  control  systems  where  monotonicity 
of  PSTL  has  already  been  found  widespread  [26]. 
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Appendix  A 


STL  Semantics 


The  formal  semantics  of  signal  temporal  logic  (STL)  are  given  as  follows: 

Definition  13  The  satisfaction  of  an  STL  formula  relative  to  a  signal  x  at  time  t  is  defined 
inductively  as 


(x,  t)  |=  ii  iff 

(x,  t)  |=  ~<p  iff 

(x,  t )  |=  pi  A  p2  ff 

(x,i)  |=  Pi  U[aj6]  (p2  ff 


x  satisfies  p  at  time  t 
(x,f)  | f=  p 

(x,  t)  \=  pi  and  (x,  t)  |=  p2 
£  [t  T  a,  t  T  b\  s.t. 

(x,  t')  |=  p2  and 

Mt"  £  [t  +  a,  t'),  (x,  t")  |= 


Extension  of  the  above  semantics  to  other  kinds  of  intervals  (open,  open-closed,  and  closed- 
open)  is  straightforward.  We  write  x  |=  p  as  a  shorthand  of  (x,  0)  |=  p. 
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